Code by TomMakesThings

About

The purpose of this notebook is to explore different NLP techniques such as: normalization, tokenization, stop word removal, stemming, lemmatization, multi-hot encoding, word vector embedding and classification through a neural network. Rather than experimenting with the whole dataset, an LSTM classifier is created and trained over 5000 randomly selected samples. As a result, the accuracy could be better. However, later into the project the text pre-processing steps and parameters for the classifier were improved and the model converted into a pipeline in another notebook.

Imports

This notebook has been developed using Python 3.8.5 and Anaconda3 with conda 4.10.1. If you would like to recreate the environment, the YAML file environment.yml can be found on GitHub. Using this, it is possible to recreate the environment using the command conda env create -f environment.yml through the Anaconda terminal. For more detail refer to the conda docs.

IMDb Dataset

Open the dataset, drop irrelevent columns and remove samples with missing information.

Analysing the Dataset

Metrics

Each film contains between 1 to 3 genres stored in genre. However, they are stored in the same column meaning some processing is required to separate each one to count the true number. For example, Drama, Romance and Drama, Romance would be counted as three different genres, even though only two unique genres are present.

Print information about the dataset:

Graphs

Scattertext graph of word occurence between genres

This produces a visual comparision between frequent words from a given genre and all other genres. For example, when genre = 'Horror', the graph shows that words such as 'blood', 'rescue', 'children' appear frequently in horror film descriptions, but infrequently for other genres. These words are therefore useful indicators for identifying this genre. Similarly, words such as 'life' and 'mother' do not commonly appear in horror film descriptions and so their presence indicate a film is unlikely to be horror. Words such as 'film' and 'friends' appear to be common across many genres and so are unlikely to be helpful for identification. These could be removed along with regular stop words such as 'the' and 'of'.

Word frequency distribution of a given genre

Plot the most common words for a given genre. The first graph includes all words, although is dominated by stop words such as 'a', 'the', 'to'. The second graph has stop words removed. These are more characteristic of the genre, e.g "house" and "killer" for horror. The frequency distribution provide an interesting comparision against topics assigned through unsupervised topic modelling algorithms.

N-gram distribution

The first graph below display the most common bigrams across all genres. However, it is dominated by common stop word pairs. As stop words will be removed, I also plotted the most frequent bigrams after removing stop words. This reveals common characters such as "young man", "best friend" and "serial killer", as well as common places such as "high school" and "new york".

Now I plot the most common trigrams. Again the first graph is dominated by stop words, while the second reveals common themes such as "world war ii", "new york city" and "based on a true story".

Label distribution

View the number of films belonging to each genre. Across all samples, the most common genres are Drama (26.7%), Comedy (16.4%) and Romance (8.0%). The least common is News with only one sample across the whole dataset. Other rare genres include Adult and Documentary which have two each, and Reality-TV which has three. After this is Film-Noir with 663. The rarest genres are removed later in the notebook as they do not provide enough samples to accurately train a classifier.

Visualising word relationships with displaCy

The graph below shows syntactic dependencies and part-of-speech tags. This gives a visual representation of the relationships between words in a description that should be captured by the LSTM.

Drop samples

As the dataset is large, samples are dropped so that the notebook can run in a reasonable time. n_samples specifies the maximum number of samples to use. Setting n_samples = 0 will cause all suitable samples to be used, though the notebook will be slow to run.

min_length specifies the minimum number of words in a description and samples less than this will be removed.

Setting remove_rare_genres = True will remove genres with less than rare_count instances from the movies' labels. If a sample does not have a label for any remaining genre, it will be dropped. For example, if rare_count = 10, genres [News, Adult, Documentary, Reality-TV] are removed.

max_genre_samples is the maximum number of samples of each genre allowed. For example if max_genre_samples = 100, a maximum of 100 samples are allowed for Comedy, another 100 for Drama etc. This can be turned off by setting max_genre_samples = 0.

Samples

Stop words

Create a custom stop word list movie_stop_words of the most common words across all movie descriptions.

Select which stop word list to use and print it. Set stop = 0 to use no pre-defined stop word list, stop = 1 to use NLTK, or any other value, e.g. stop = -1, for spaCy. Set add_custom_words = True to add the movie_stop_words to the stop words.

Here I have used NLTK to filter out common words that do not contribute to the meaning of the descriptions. Although movie_stop_words could be appended, I decided against this as although words such as love and young are common in most genres, they may still help in distinguish some genres. For example, if comparing romance to horror.

Normalisation

Create a function to convert accented characters. For example, è becomes e in "Arsène Baudu and Hyacinthe, a pair of small-time crooks".

Load a model to produce contractions, e.g. can't becomes cannot in "They fall in love, but can't quite seem to get the timing right."

Processing samples

Process film descriptions and store as description_docs

  1. Normalisation by removing accents and contractions
  2. Split descriptions into sentences and perform tokenisation
  3. Remove puncuation
  4. Remove stop words (optional)
  5. Correct spelling (optional)
  6. Apply lemmatizing or stemming (optional)

I decided to use lemmatization rather than stemming as tokens are embedded with GloVe pre-trained before they are inputting into the LSTM. As GloVe was trained on full English words, lemmatization is more suitable as it keeps words in dictionary form using PoS tagging. By contrast, stemming reduces words to their stem, e.g. "ponies" -> "poni" which are less likely to be included in the pre-trained embeddings.

Here I have converted text into 1-grams, although common n-grams found in GloVe such as "new-york" could be added as this would help the LSTM to pick up on word depencencies.

Even though I added pyspellchecker to correct spelling mistakes, I did not use it because it massively increases processing time.

Again I have a filter to remove samples if they contain less words than min_words. This is because removing stop words in the code above will sometimes make some descriptions too short which makes them unsuitable for model training.

Labels

Encode the labels (genres) of each film as a binary multi-hot representation using binary_encoder. This allows comparision between the LSTM's output probabilities against each label. The binary encoder is saved to the file specified by binary_encoder_file so that it may be reloaded again.

View the distribution of labels for selected samples:

Note the counts do not add to the number of samples as genres are multi-label.

The classes are imbalanced which can be a problem as it means the classifier can overpredict dominant classes, such as Drama, and under predict minority ones such as Horror. However, a weighted value for each label can be used to scale loss calculation so that the model is penalized more if it misclassifies a minority genre. Here this is calculated as the number of label counts without a genre divided by the total number of label counts.

Create new dataframe of processed samples

This dataframe below contains the fields used further into the notebook to construct a TorchText Dataset.

Supervised Classification

Set up PyTorch to make the results semi-reproducable.

Classifier class

Define a class for the LSTM classifier.

Set up training, testing and validation data

Create TorchText Fields defining the ID, description and multi-hot label of each film. These correspond to the columns in processed_samples. Then construct a TorchText dataset from the films in processed_samples.

Split data

Separate dataset samples into training, testing and validation in a ratio 0.49 : 0.3 : 0.21. This means that 70% of the data is used during model training, while 30% is kept unseen for evaluation.

To use k-fold cross-validation with k folds, set use_k_folds = True. For example, if k = 5, training data will be split 5 different ways. If k-folds is not used, data will be randomly split. I used cross validation when training the final model, but not when making small changes during model development. This is because it takes a long time to run as it trains k different models.

Set word embeddings of tokens

Construct a vocabulary for the TEXT (description) field. trained_vector is set as a pre-trained word embedding vector, such as:


For this project I decided to use GloVe. This is because:

  1. Tokens have not been processed in the correct format to use CharNGram, e.g. ['2gram-ok', '3gram-#BEGIN#na', '4gram-ess#END#'].
  2. Although simple FastText is also suitable, fewer words from the dataset are in the pre-trained embedding (see section below). This is perhaps because GloVe was trained on Wikipedia articles, some of which are about films. By contrast, FastText was trained on news articles, which are less likely to be about niche films.
View untrained tokens

View all tokens that have not been given a pre-trained vector word embedding, i.e. the embeddings the model will have to learn from scratch. When using "glove.6B.100d" there are fewer words not included compare to using "fasttext.simple.300d". It is better if this number is lower as this means the embedding is better suited to the data. Therefore, GloVe has been selected.

Looking at these tokens is also useful for seeing how I could improve token pre-processing so that more of them were included. For example during development, I noticed that tokens such as "(years" were displayed here as spaCy token.is_punct did not remove the punctuation within tokens. After stripping this punctuation, more words were mapped to a pre-trained embedding and the model's accuracy improved.

Another observation is that many of these words contain spelling mistakes, such as "neiborhood", "charachter" and "athiest". Therefore I added a spell corrector to the pre-processing. However, I have not used it here as spell correction is computationally expensive.

Create a new classifier

Set the model parameters:

Train the classifier

The function below returns a tensor of weights that can be used to created a weighted loss function. Applying these weights will make the loss function assign a high penalty if it does not predict an expected label, and a lowered penalty otherwise. This was created because the model was converging to predict all zeros as a way to minimise loss. When this happens, this meant that it was not learning to predict any genre for the majority of descriptions.

These weights are calculated if weighted_loss = True when calling the training function train_model. Optionally, class_weights can be added as a scaling factor so that the loss function is weighted towards giving a higher loss for incorrectly predicted rarer genres.

Below is the function used to train the model. The best model is saved if a file name is given, e.g. file = 'film_classifier.pt'. The best model is determined by measure, which can be set to either loss or accuracy. For example, if measure = "accuracy", the weights of the model with the highest validation accuracy will be saved.

Intitally, I used cross entropy to evaluate predicted labels. However, this only let me evaluate performance against one matching label, e.g. drama, even though some films have up to three genres, e.g. drama, comedy, sci-fi. This meant that it was treating the problem as multi-class instead of multi-label, and so the model was sometimes being penalised even though its prediction was correct. To solve this, the labels are now converted into multi-hot binary encoding so that the multi-label representation would be captured. The loss function was therefore changed to binary cross entropy.

Model Arguments

Set the model arguments as dictionary form. Then save the arguments to a file specified by model_kwargs_file.

Perform k-fold cross validation

If use_k_folds = True, train k models over a small number of epochs to find the best train-validation data split. Then the split with the highest validation accuracy is selected to train the final model.

Train the classifier

Training, testing and validation data is split into bucket iterators. This allows data to be seperated into batches of size batch_size to perform mini-batch gradient descent. If k-fold cross validation was performed, then the fold with the lowest validation loss is used.

Create the classifier model and set its parameters. Move the model from CPU to GPU if possible as this will speed up training.

Define file names best_state_file and final_state_file to save the state of the model that performs the best on validation data, and the final state of the model. Then create a new optimiser and train the classifier. To use regualization, set L2_penalty as a value such as 1e-5, or set as 0 to use no regularization. When calculate_f1 = True, F1 score will also be recorded, though this will make training a bit slower. This is a measure that balances precision and recall and is useful for unbalanced datasets.

Plot the results

Plot the training loss and accuracy per epoch. If calculate_f1 = True, a graph will also be shown for F1 score.


In this experiment, the model overfits after about 15 epochs as shown by the training / validation loss. The validation accuracy also levels off after this point. The highest validation accuracy is 27.2% at epoch 43, and so this is the state of the model saved to file. The validation's F1 score is also highest at epoch 43 which means that this model state has the best balance between precision and recall.

Evaluate the classifier

The function below returns the testing loss, F1 score, accuracy and multi-label confusion matrices.

Test the classifier's performance on unseen data. The state of the model to load can be changed through setting model_state_to_load:

Confusion matrix

Plot confusion matrices for each genre. These show:

Prediction

The function below takes a custom text input, pre-processes the text, converts it into a form suitable for the classifier and returns the predicted genres.

Enter an IMDb film / series description and see the predicted genre(s).

For example: